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Abstract 



O 

" Tji ' An expression is proposed for determining the error made by neglecting 

^. finite sample effects in entropy estimates. It is based on the Ansatz that 

iri ■ the ranked distribution of probabilities tends to follow a Zipf scaling. 



> 

\q • 1 Introduction 

O ! 

\Q . The growing interest in complexity measures and symbolic dynamics [ffl, pf has 

brought to the forefront various problems related to the estimation of entropic 

^} . quantities from finite sequences 0|. Such estimates are known to suffer from a 

bias, which prevents quantities such as the metric entropy from being meaning- 
fully estimated. The purpose of this letter is to provide an analytical expression 
^h ' for this bias, in order in order to test for finite sample effects in entropy estimates. 

£>-»! Consider the general case of a string of TV symbols {«ii2 • • ■ ijv}, each of which 

i-£h ■ belongs to a finite alphabet A. The average informational content of substrings 

of length d taken from this sequence is expressed by the Shannon entropy B 



H d = - ^ M (1*1*2 • ■■ id]) log n ([i^ •■ ■ id}) , (1) 

ii,...,id£A 

where fi is the natural invariant measure with respect to the shift. Of particular 
interest is the block or dynamical Shannon entropy hd = Hd+\ — Hd from which 
one gets the measure-theoretic entropy of the system 

h(ji) = lim h d , (2) 

a — >-oc 



a quantity that is intimately related to the Kolmogorov-Sina'i entropy in case 
the string represents the output of a shift dynamical system. 

The main problem lies in the estimation of the empirical measure fj, from a 
finite string of symbols. Direct box counting yields 

(\- ■ ■ w ~ #1^2- ■■id] , Q , 

M([«Ha---*d])» N _ d+1 , (3) 

where #[«ii2 • ■ ■ id] is the occurrence frequency of the block Jii2 • • • id in the 
string. It is well known that statistical fluctuations in the sample on average 
lead to a systematic underestimation of the entropy. This problem becomes 
particularly acute as the word size increases for a given string length N. Since 
this deviation can easily be mistaken for the signature of a finite memory process, 
it is of prime importance to determine whether its origin is physical or not. 

Several authors have already addressed the problem of making corrections to 
empirical entropy estimates [pL pi o, m ; their expressions are valid as long as the 
occurrence frequencies of the observed words are large compared to one. While 
this may hold for relatively short words, it breaks down for long ones, making 
it difficult for a small correction to be used as a safe indication for a small 
deviation. Our objective is to derive a more reliable (although less accurate) 
expression of the deviation, to be used as a warning signal against the onset of 
finite sample effects. 

As a first guess one could require the sample to be long enough for each 
word to have a chance to appear. This gives N 3> A s ^ mb , where A^ sym b is the 
cardinality of the alphabet. This criterion, however, is generally found to be too 
conservative because it does not take into account the grammar, i.e. the rules 
that cause some words to be forbidden or less frequent than others. 

2 The Zipf-ordered distribution 

To derive our expression, we first rank the words according to their frequency 
of occurrence: let rik=i denote the frequency of occurrence of the most probable 
word, rifc =2 of the next most probable one etc. Multiple instances of the same 
frequency get consecutive ranks. This monotonically decreasing distribution is 
called Zipf-ordered. 

The Asymptotic Equipartition Property introduced by Shannon states 
that the ensemble of words of length d can be divided into two subsets. The 
first one consists of "typical words" that occur frequently and roughly have 
the same probability of occurrence. The other subset is made of "rare words" 
that belong to the tail of the distribution. According to the Shannon-Breiman- 
MacMillan theorem, the entropy is related to the typical words in the limit 
where N — > oo; the contribution of rare words progressively disappears as N 
increases. In some sense this observation justifies the procedure to be described 
below. 



It was noted by Pareto ||], Zipf and others, and later interpreted by 
Mandelbrot |l^] that the tail of the Zipf-ordered distribution n k tends to follow 
a universal scaling law 

n k = ak" 1 , 7 > . (4) 

which is found with astonishing reproducibility in economics, social sciences, 



physics etc. p0[. As shown in |11, 021 , many different systems give rise to Zipf 
laws, whose ubiquity is thought to be essentially a consequence of the ranking 
procedure. 

The physical meaning of Zipf's law is still an unsettled question, although 
it does not seem to reflect any particular self-organization (see for example [n3l 
Q). We just mention that a slow decay is an indication for a "rich vocabulary" , 
in the sense that rare words occur relatively often. 

The key point is that the empirical Zipf-ordered distribution has a cutoff at 
some finite value k = iV max because of the finite length of the symbol string. 
For the same reason, the occurrence frequencies are necessarily quantized. Our 
main hypothesis is that the true distribution extends beyond iV max , up to the 
lexicon size K > -ZV max , following Zipf's law with the same exponent 7. This 
Ansatz has already been suggested as a way to estimate entropies from long 
words fllal. 



3 Estimating the bias 

Let H be the Shannon entropy computed from the empirical distribution (using 
eqs. nl and 0) and H the entropy one would obtain from a non truncated distri- 
bution, in which the frequencies are not quantized anymore and extend beyond 
Aniax following Zipf's law. 

JVmax 

u tt-. n k n k 

H = - Z, ^JV max lo § vJVm „ ■ ( 5 ) 

K 



H - -E^^log: 



fc=l n k l^k=\ n k 

The truncation has two counteracting effects. It changes the renormalization 
of the occurrence frequencies and causes some of the least frequent words to be 
omitted. 

The difference 6 between the two entropy estimates 

S = H -H . (6) 

is what we call the bias, to be used as a measure of the deviation resulting from 
finite sample effects. We shall assume that N max 3> 1, which is equivalent to 
saying that the distribution must have a sufficiently long tail for a power law to 
make sense. 



It is natural to define a small parameter < £ <C 1, which goes to zero for 
a non truncated distribution 



K 

1 

e 



v t »*■ oo 



TV 

fc=JV max +l 



Remember that TV = ^3fe=i n fc p6j- 



Now, assuming that Zipf's law persists for k > N max , we have 

K 

1 

£ 



^ E a*-^ = ^(C(7,JVmax + l)-C(7,A' + l)), (8) 

fc=JV max +l 

where £(7,771) is the Hurwitz or generalized Riemann zeta function. For k > 7, 
the following approximation holds |17|] 

to 1-7 m~ 7 m ^ ,„. 

C(7,m) = ^ ZT - — + n[5 -. (9) 

Since i£, iV max 3> 1, we may write 

e = (TV 1-7 - if 1-7 ) (10) 

7V( 7 _ l) v max / v ^ 

The value of a remains to be determined. To do so, we note that the least 
frequent words in the Zipf-ordered distribution occur once or a few times only. 
One may therefore reasonably set 7ifc = Ar max w 1, giving a re -/V 7 lax . 

The bias (5 can now be expanded in powers of e. Keeping terms of order 
0(e) only, we have 

S = -eH + (l+e)(s- £ | lo g|l • (") 

For the conditions stated before, the sum can be approximated by 

t ^lo 6 =S = - £ (log W -7lo e %), (12) 

finally giving the result of interest 

6 = e (l + logJV-F-7log^p) ( i3) 



iV(7-l) \ V # 

Notice that the true entropy is always underestimated; furthermore e is contin- 
uous at 7 = 1 jlq . Most of the variation comes from the small parameter e, 
whose expression reveals two different effects : 



1. the ratio N max /N reflects the uncertainty of the frequency estimates. 

2. the scaling index 7, whose value is usually between 0.5 to 1.5, is indicative 
of the lacunarity of the word distribution. In the case of a shift dynamical 
system, 7 reveals how unevenly the rare orbits fill the phase space. 

For the sake of comparison, the first order approximation for finite sample effects 
derived in [|| g] is 

*=%*• ( 14 ) 

We conclude from eq. |l3| that the bias is not just related to statistical fluctu- 
ations in the empirical occurrence frequency, but is also caused by the omission 
of words that are asymptotically rare. If the true distribution of the ranked 
words were exponential or ultimately ended with an exponential tail, then our 
criterion would be too conservative but still reliable as such. 

The following procedure is proposed for detecting the maximum word length 
for which entropies can be meaningfully estimated : compute Zipf-ordered dis- 
tributions for increasing word-lengths d. For each length, estimate the bias S 
by least-squares fitting a power law to the tail of the observed distribution. As 
soon as this bias exceeds a given threshold (say 10% of H), then entropies com- 
puted from longer words are likely to be significantly corrupted by finite sample 
effects. 

Equation fil| supposes that the maximum lexicon size K is known a priori, 
which is seldom the case. This is not a serious handicap, however, since the 
value of K has relatively little impact on the bias; a rough approximation such 
as K = -/V S y mb may do well. 

4 Two examples 

To briefly illustrate the results, we now consider two examples. The first one 
is based on a Bcrnouilli process, whose entropy and Zipf-ordered distribution 
can be calculated analytically. The string of symbols is drawn from a two letter 
alphabet, one with probability A and the other with probability 1 — A. The 
block entropy of this process is independent of the word length and equals 

/i = -AlogA-(l-A)log(l-A) . (15) 

Figure 1 compares the true block entropy with estimates drawn from a sam- 
ple of length N = 2000 with A = 0.15. The departure of the empirical estimate 
from the true one is evident. Without knowledge of the true entropy, however, 
it is very difficult to tell whether the decrease of the entropy is an artifact or 
just the signature of a short-time memory. 

The second panel displays the true and the empirical Zipf-ordered distribu- 
tions as obtained for words of length d — 9. Zipf's law clearly holds for words 



whose rank exceeds about 30. After this, the scaling exponent 7 is estimated, 
see the third panel. The decrease of this exponent with the word length d sug- 
gests that the contribution of the rare words becomes increasingly important. 
Finally, the bias 5, which is shown in the fourth panel, suggests that the onset 
of a significant bias occurs around d = 8; this value is indeed in agreement with 
the results of the first panel. 

The validity of the bias estimate was tested on various examples and was 
found to be reliable, provided that ./V max ^> 1. 

In the second example, we consider a sequence of N = 10 symbols generated 
by the logistic map Xj+i = Xxi(l — Xi) in a chaotic regime with A = 3.8. The 
(generating) partition V = {[0, 0.5[, [0.5, 1]} gives us a two-letter alphabet. 

Figure 2 again shows that the block entropy decreases above a certain word 
length. In contrast to the previous example, the measured scaling exponent 7 is 
small and almost constant, regardless of the word length. We believe this to be a 
consequence of the intricate structure of the self-similar attractor. This low value 
of 7 already suggests that rare words should bring a significant contribution to 
the entropy. The bias 5 finally suggests stopping at d = 12. 
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Figure 1: Analysis of a Bernouilli sequence, with N = 2000 and A = 0.15. 
From top to bottom: (1) the empirical block entropy and the true one (dashed), 
(2) the true (line) and observed (dots) Zipf-ordered distributions for words of 
length d = 8; (3) the scaling exponent 7 obtained by fitting the tail of the Zipf- 
ordered distribution (error bars represent ±1 standard deviation resulting from 
the least-squares fit), (4) the bias 5. In this case, entropies cannot be reliably 
estimated for word lengths beyond d =? 9. Block entropies are normalized to 
log-/V sym b, so that the maximum possible value is 1. 
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Figure 2: Analysis of a logistic map sequence, with the same legend as the 
previous figure; the string length is TV = 10 4 . The second panel shows a Zipf- 
ordered distribution for d — 18. The largest word size for which the relative 
bias is smaller than 10%, is d = 12. 



5 Conclusion 

Summarizing, we have derived a simple expression (eq. O) for detecting the 
onset of finite sample size effects in entropy estimates. It is based on the em- 
pirical evidence that rank-ordered distribution of words tend to follow Zipf 's 
law. The criterion reveals that rare events can significantly bias the empirical 
entropy estimate. 
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